{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_02_4_pandas_functional.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T81-558: Applications of Deep Neural Networks\n",
    "\n",
    "**Module 2: Python for Machine Learning**\n",
    "\n",
    "- Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "- For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 2 Material\n",
    "\n",
    "Main video lecture:\n",
    "\n",
    "- Part 2.1: Introduction to Pandas [[Video]](https://www.youtube.com/watch?v=wixHCvnvnsU&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_1_python_pandas.ipynb)\n",
    "- Part 2.2: Categorical Values [[Video]](https://www.youtube.com/watch?v=Fm7Ax23hDP0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_2_pandas_cat.ipynb)\n",
    "- Part 2.3: Grouping, Sorting, and Shuffling in Python Pandas [[Video]](https://www.youtube.com/watch?v=tUhaD8xWd7k&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_3_pandas_grouping.ipynb)\n",
    "- **Part 2.4: Using Apply and Map in Pandas** [[Video]](https://www.youtube.com/watch?v=YNo_mg1RrkM&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_4_pandas_functional.ipynb)\n",
    "- Part 2.5: Feature Engineering in Pandas for Deep Learning in PyTorch [[Video]](https://www.youtube.com/watch?v=ezaVtM405Qs&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_5_pandas_features.ipynb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google CoLab Instructions\n",
    "\n",
    "The following code ensures that Google CoLab is running the correct version of TensorFlow.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: using Google CoLab\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    COLAB = True\n",
    "    print(\"Note: using Google CoLab\")\n",
    "except:\n",
    "    print(\"Note: not using Google CoLab\")\n",
    "    COLAB = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2.4: Apply and Map\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you've ever worked with Big Data or functional programming languages before, you've likely heard of map/reduce. Map and reduce are two functions that apply a task you create to a data frame. Pandas supports functional programming techniques that allow you to use functions across en entire data frame. In addition to functions that you write, Pandas also provides several standard functions for use with data frames.\n",
    "\n",
    "## Using Map with Dataframes\n",
    "\n",
    "The map function allows you to transform a column by mapping certain values in that column to other values. Consider the Auto MPG data set that contains a field **origin_name** that holds a value between one and three that indicates the geographic origin of each car. We can see how to use the map function to transform this numeric origin into the textual name of each origin.\n",
    "\n",
    "We will begin by loading the Auto MPG data set.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>displacement</th>\n",
       "      <th>...</th>\n",
       "      <th>year</th>\n",
       "      <th>origin</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>307.0</td>\n",
       "      <td>...</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>chevrolet chevelle malibu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15.0</td>\n",
       "      <td>8</td>\n",
       "      <td>350.0</td>\n",
       "      <td>...</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>buick skylark 320</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>396</th>\n",
       "      <td>28.0</td>\n",
       "      <td>4</td>\n",
       "      <td>120.0</td>\n",
       "      <td>...</td>\n",
       "      <td>82</td>\n",
       "      <td>1</td>\n",
       "      <td>ford ranger</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>397</th>\n",
       "      <td>31.0</td>\n",
       "      <td>4</td>\n",
       "      <td>119.0</td>\n",
       "      <td>...</td>\n",
       "      <td>82</td>\n",
       "      <td>1</td>\n",
       "      <td>chevy s-10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>398 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      mpg  cylinders  displacement  ...  year  origin  \\\n",
       "0    18.0          8         307.0  ...    70       1   \n",
       "1    15.0          8         350.0  ...    70       1   \n",
       "..    ...        ...           ...  ...   ...     ...   \n",
       "396  28.0          4         120.0  ...    82       1   \n",
       "397  31.0          4         119.0  ...    82       1   \n",
       "\n",
       "                          name  \n",
       "0    chevrolet chevelle malibu  \n",
       "1            buick skylark 320  \n",
       "..                         ...  \n",
       "396                ford ranger  \n",
       "397                 chevy s-10  \n",
       "\n",
       "[398 rows x 9 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\n",
    "    \"https://data.heatonresearch.com/data/t81-558/auto-mpg.csv\", na_values=[\"NA\", \"?\"]\n",
    ")\n",
    "\n",
    "pd.set_option(\"display.max_columns\", 7)\n",
    "pd.set_option(\"display.max_rows\", 5)\n",
    "\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **map** method in Pandas operates on a single column. You provide **map** with a dictionary of values to transform the target column. The map keys specify what values in the target column should be turned into values specified by those keys. The following code shows how the map function can transform the numeric values of 1, 2, and 3 into the string values of North America, Europe, and Asia.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>displacement</th>\n",
       "      <th>...</th>\n",
       "      <th>origin</th>\n",
       "      <th>name</th>\n",
       "      <th>origin_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>272</th>\n",
       "      <td>23.8</td>\n",
       "      <td>4</td>\n",
       "      <td>151.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>oldsmobile starfire sx</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>19.9</td>\n",
       "      <td>8</td>\n",
       "      <td>260.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>oldsmobile cutlass salon brougham</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>201</th>\n",
       "      <td>18.5</td>\n",
       "      <td>6</td>\n",
       "      <td>250.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>pontiac ventura sj</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>317</th>\n",
       "      <td>34.3</td>\n",
       "      <td>4</td>\n",
       "      <td>97.0</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>audi 4000</td>\n",
       "      <td>Europe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>25.0</td>\n",
       "      <td>4</td>\n",
       "      <td>110.0</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>peugeot 504</td>\n",
       "      <td>Europe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>214</th>\n",
       "      <td>13.0</td>\n",
       "      <td>8</td>\n",
       "      <td>302.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>ford f108</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>303</th>\n",
       "      <td>31.8</td>\n",
       "      <td>4</td>\n",
       "      <td>85.0</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>datsun 210</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281</th>\n",
       "      <td>19.8</td>\n",
       "      <td>6</td>\n",
       "      <td>200.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>mercury zephyr 6</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130</th>\n",
       "      <td>26.0</td>\n",
       "      <td>4</td>\n",
       "      <td>122.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>ford pinto</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92</th>\n",
       "      <td>13.0</td>\n",
       "      <td>8</td>\n",
       "      <td>351.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>ford ltd</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>398 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      mpg  cylinders  displacement  ...  origin  \\\n",
       "272  23.8          4         151.0  ...       1   \n",
       "249  19.9          8         260.0  ...       1   \n",
       "201  18.5          6         250.0  ...       1   \n",
       "317  34.3          4          97.0  ...       2   \n",
       "20   25.0          4         110.0  ...       2   \n",
       "..    ...        ...           ...  ...     ...   \n",
       "214  13.0          8         302.0  ...       1   \n",
       "303  31.8          4          85.0  ...       3   \n",
       "281  19.8          6         200.0  ...       1   \n",
       "130  26.0          4         122.0  ...       1   \n",
       "92   13.0          8         351.0  ...       1   \n",
       "\n",
       "                                  name    origin_name  \n",
       "272             oldsmobile starfire sx  North America  \n",
       "249  oldsmobile cutlass salon brougham  North America  \n",
       "201                 pontiac ventura sj  North America  \n",
       "317                          audi 4000         Europe  \n",
       "20                         peugeot 504         Europe  \n",
       "..                                 ...            ...  \n",
       "214                          ford f108  North America  \n",
       "303                         datsun 210           Asia  \n",
       "281                   mercury zephyr 6  North America  \n",
       "130                         ford pinto  North America  \n",
       "92                            ford ltd  North America  \n",
       "\n",
       "[398 rows x 10 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Apply the map\n",
    "df[\"origin_name\"] = df[\"origin\"].map({1: \"North America\", 2: \"Europe\", 3: \"Asia\"})\n",
    "\n",
    "# Shuffle the data, so that we hopefully see\n",
    "# more regions.\n",
    "df = df.reindex(np.random.permutation(df.index))\n",
    "\n",
    "# Display\n",
    "pd.set_option(\"display.max_columns\", 7)\n",
    "pd.set_option(\"display.max_rows\", 10)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Apply with Dataframes\n",
    "\n",
    "The **apply** function of the data frame can run a function over the entire data frame. You can use either a traditional named function or a lambda function. Python will execute the provided function against each of the rows or columns in the data frame. The **axis** parameter specifies that the function is run across rows or columns. For axis = 1, rows are used. The following code calculates a series called **efficiency** that is the **displacement** divided by **horsepower**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "272    1.776471\n",
       "249    2.363636\n",
       "201    2.272727\n",
       "317    1.243590\n",
       "20     1.264368\n",
       "229    2.222222\n",
       "211    1.400000\n",
       "1      2.121212\n",
       "52     1.157895\n",
       "203    1.366197\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "efficiency = df.apply(lambda x: x[\"displacement\"] / x[\"horsepower\"], axis=1)\n",
    "display(efficiency[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can now insert this series into the data frame, either as a new column or to replace an existing column. The following code inserts this new series into the data frame.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"efficiency\"] = efficiency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering with Apply and Map\n",
    "\n",
    "In this section, we will see how to calculate a complex feature using map, apply, and grouping. The data set is the following CSV:\n",
    "\n",
    "- https://www.irs.gov/pub/irs-soi/16zpallagi.csv\n",
    "\n",
    "This URL contains US Government public data for \"SOI Tax Stats - Individual Income Tax Statistics.\" The entry point to the website is here:\n",
    "\n",
    "- https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi\n",
    "\n",
    "Documentation describing this data is at the above link.\n",
    "\n",
    "For this feature, we will attempt to estimate the adjusted gross income (AGI) for each of the zip codes. The data file contains many columns; however, you will only use the following:\n",
    "\n",
    "- **STATE** - The state (e.g., MO)\n",
    "- **zipcode** - The zipcode (e.g. 63017)\n",
    "- **agi_stub** - Six different brackets of annual income (1 through 6)\n",
    "- **N1** - The number of tax returns for each of the agi_stubs\n",
    "\n",
    "Note, that the file will have six rows for each zip code for each of the agi_stub brackets. You can skip zip codes with 0 or 99999.\n",
    "\n",
    "We will create an output CSV with these columns; however, only one row per zip code. Calculate a weighted average of the income brackets. For example, the following six rows are present for 63017:\n",
    "\n",
    "| zipcode | agi_stub | N1   |\n",
    "| ------- | -------- | ---- |\n",
    "| 63017   | 1        | 4710 |\n",
    "| 63017   | 2        | 2780 |\n",
    "| 63017   | 3        | 2130 |\n",
    "| 63017   | 4        | 2010 |\n",
    "| 63017   | 5        | 5240 |\n",
    "| 63017   | 6        | 3510 |\n",
    "\n",
    "We must combine these six rows into one. For privacy reasons, AGI's are broken out into 6 buckets. We need to combine the buckets and estimate the actual AGI of a zipcode. To do this, consider the values for N1:\n",
    "\n",
    "- 1 = 1 to 25,000\n",
    "- 2 = 25,000 to 50,000\n",
    "- 3 = 50,000 to 75,000\n",
    "- 4 = 75,000 to 100,000\n",
    "- 5 = 100,000 to 200,000\n",
    "- 6 = 200,000 or more\n",
    "\n",
    "The median of each of these ranges is approximately:\n",
    "\n",
    "- 1 = 12,500\n",
    "- 2 = 37,500\n",
    "- 3 = 62,500\n",
    "- 4 = 87,500\n",
    "- 5 = 112,500\n",
    "- 6 = 212,500\n",
    "\n",
    "Using this, you can estimate 63017's average AGI as:\n",
    "\n",
    "```\n",
    ">>> totalCount = 4710 + 2780 + 2130 + 2010 + 5240 + 3510\n",
    ">>> totalAGI = 4710 * 12500 + 2780 * 37500 + 2130 * 62500\n",
    "    + 2010 * 87500 + 5240 * 112500 + 3510 * 212500\n",
    ">>> print(totalAGI / totalCount)\n",
    "\n",
    "88689.89205103042\n",
    "```\n",
    "\n",
    "We begin by reading the government data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"https://www.irs.gov/pub/irs-soi/16zpallagi.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we trim all zip codes that are either 0 or 99999. We also select the three fields that we need.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATE</th>\n",
       "      <th>zipcode</th>\n",
       "      <th>agi_stub</th>\n",
       "      <th>N1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>1</td>\n",
       "      <td>1510</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>2</td>\n",
       "      <td>1410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>3</td>\n",
       "      <td>950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>4</td>\n",
       "      <td>650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>5</td>\n",
       "      <td>630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179785</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>2</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179786</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>3</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179787</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179788</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>5</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179789</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>6</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>179184 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       STATE  zipcode  agi_stub    N1\n",
       "6         AL    35004         1  1510\n",
       "7         AL    35004         2  1410\n",
       "8         AL    35004         3   950\n",
       "9         AL    35004         4   650\n",
       "10        AL    35004         5   630\n",
       "...      ...      ...       ...   ...\n",
       "179785    WY    83414         2    40\n",
       "179786    WY    83414         3    40\n",
       "179787    WY    83414         4     0\n",
       "179788    WY    83414         5    40\n",
       "179789    WY    83414         6    30\n",
       "\n",
       "[179184 rows x 4 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df = df.loc[\n",
    "    (df[\"zipcode\"] != 0) & (df[\"zipcode\"] != 99999),\n",
    "    [\"STATE\", \"zipcode\", \"agi_stub\", \"N1\"],\n",
    "]\n",
    "\n",
    "pd.set_option(\"display.max_columns\", 0)\n",
    "pd.set_option(\"display.max_rows\", 10)\n",
    "\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We replace all of the **agi_stub** values with the correct median values with the **map** function.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>STATE</th>\n",
       "      <th>zipcode</th>\n",
       "      <th>agi_stub</th>\n",
       "      <th>N1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>12500</td>\n",
       "      <td>1510</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>37500</td>\n",
       "      <td>1410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>62500</td>\n",
       "      <td>950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>87500</td>\n",
       "      <td>650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>AL</td>\n",
       "      <td>35004</td>\n",
       "      <td>112500</td>\n",
       "      <td>630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179785</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>37500</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179786</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>62500</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179787</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>87500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179788</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>112500</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179789</th>\n",
       "      <td>WY</td>\n",
       "      <td>83414</td>\n",
       "      <td>212500</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>179184 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       STATE  zipcode  agi_stub    N1\n",
       "6         AL    35004     12500  1510\n",
       "7         AL    35004     37500  1410\n",
       "8         AL    35004     62500   950\n",
       "9         AL    35004     87500   650\n",
       "10        AL    35004    112500   630\n",
       "...      ...      ...       ...   ...\n",
       "179785    WY    83414     37500    40\n",
       "179786    WY    83414     62500    40\n",
       "179787    WY    83414     87500     0\n",
       "179788    WY    83414    112500    40\n",
       "179789    WY    83414    212500    30\n",
       "\n",
       "[179184 rows x 4 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "medians = {1: 12500, 2: 37500, 3: 62500, 4: 87500, 5: 112500, 6: 212500}\n",
    "df[\"agi_stub\"] = df.agi_stub.map(medians)\n",
    "\n",
    "pd.set_option(\"display.max_columns\", 0)\n",
    "pd.set_option(\"display.max_rows\", 10)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we group the data frame by zip code.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = df.groupby(by=\"zipcode\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The program applies a lambda across the groups and calculates the AGI estimate.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>zipcode</th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>52895.322940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1002</td>\n",
       "      <td>64528.451001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1003</td>\n",
       "      <td>15441.176471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1005</td>\n",
       "      <td>54694.092827</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1007</td>\n",
       "      <td>63654.353562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29867</th>\n",
       "      <td>99921</td>\n",
       "      <td>48042.168675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29868</th>\n",
       "      <td>99922</td>\n",
       "      <td>32954.545455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29869</th>\n",
       "      <td>99925</td>\n",
       "      <td>45639.534884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29870</th>\n",
       "      <td>99926</td>\n",
       "      <td>41136.363636</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29871</th>\n",
       "      <td>99929</td>\n",
       "      <td>45911.214953</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>29872 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       zipcode             0\n",
       "0         1001  52895.322940\n",
       "1         1002  64528.451001\n",
       "2         1003  15441.176471\n",
       "3         1005  54694.092827\n",
       "4         1007  63654.353562\n",
       "...        ...           ...\n",
       "29867    99921  48042.168675\n",
       "29868    99922  32954.545455\n",
       "29869    99925  45639.534884\n",
       "29870    99926  41136.363636\n",
       "29871    99929  45911.214953\n",
       "\n",
       "[29872 rows x 2 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    groups.apply(lambda x: sum(x[\"N1\"] * x[\"agi_stub\"]) / sum(x[\"N1\"]))\n",
    ").reset_index()\n",
    "\n",
    "pd.set_option(\"display.max_columns\", 0)\n",
    "pd.set_option(\"display.max_rows\", 10)\n",
    "\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now rename the new **agi_estimate** column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>zipcode</th>\n",
       "      <th>agi_estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>52895.322940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1002</td>\n",
       "      <td>64528.451001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1003</td>\n",
       "      <td>15441.176471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1005</td>\n",
       "      <td>54694.092827</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1007</td>\n",
       "      <td>63654.353562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29867</th>\n",
       "      <td>99921</td>\n",
       "      <td>48042.168675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29868</th>\n",
       "      <td>99922</td>\n",
       "      <td>32954.545455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29869</th>\n",
       "      <td>99925</td>\n",
       "      <td>45639.534884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29870</th>\n",
       "      <td>99926</td>\n",
       "      <td>41136.363636</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29871</th>\n",
       "      <td>99929</td>\n",
       "      <td>45911.214953</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>29872 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       zipcode  agi_estimate\n",
       "0         1001  52895.322940\n",
       "1         1002  64528.451001\n",
       "2         1003  15441.176471\n",
       "3         1005  54694.092827\n",
       "4         1007  63654.353562\n",
       "...        ...           ...\n",
       "29867    99921  48042.168675\n",
       "29868    99922  32954.545455\n",
       "29869    99925  45639.534884\n",
       "29870    99926  41136.363636\n",
       "29871    99929  45911.214953\n",
       "\n",
       "[29872 rows x 2 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df.columns = [\"zipcode\", \"agi_estimate\"]\n",
    "\n",
    "pd.set_option(\"display.max_columns\", 0)\n",
    "pd.set_option(\"display.max_rows\", 10)\n",
    "\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we check to see that our zip code of 63017 got the correct value.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>zipcode</th>\n",
       "      <th>agi_estimate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>19909</th>\n",
       "      <td>63017</td>\n",
       "      <td>88689.892051</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       zipcode  agi_estimate\n",
       "19909    63017  88689.892051"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df[\"zipcode\"] == 63017]"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3.9 (torch)",
   "language": "python",
   "name": "pytorch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
